Regression analysis

Linear and logistic regression

Is this html according to rmarkdown? TRUE

Linear regression Logistic regression
Response variable Numerical Categorical
Relationship between predictor estimate and response Linear Logistic (linear relationship between estimate and log odds of response)
Fitting function OSL (ordinary least squares) MLE (maximum likelihood estimation)
Model comparison F-test AIC
Evaluation metric \(R^2\) \(C\)
Base R function lm() glm()

Simple linear regression

Simple linear regression

Simple

one predictor variable

Linear

linear relation between estimated parameters and response variable

Notation

y ~ x

Estimation: OSL

Ordinary least squares, minimizing sum of squares of residuals

Example

library(ggplot2)
library(tibble)
set.seed(2022)

utt_lengths <- tibble(
  age = 3:17,
  utterance_length = 3 + 0.5*seq_along(age) + rnorm(length(age), sd = 0.3)
)
m <- lm(utterance_length ~ age, data = utt_lengths)
utt_lengths$fit <- m$fitted.values

g <- ggplot(utt_lengths, aes(x = age, y = utterance_length)) +
  labs(x = "Age", y = "Utterance length") +
  xlim(c(0,18)) + ylim(c(0,15)) +
  theme_minimal(base_size = 14) + theme(aspect.ratio = 1)

Plotted values

g <- g + geom_point()
g

Adding fitted line

g <- g + geom_line(aes(y = fit))
g

Residuals

g <- g + geom_segment(aes(xend = age, yend = fit))
g

Model output

summary(m)

Call:
lm(formula = utterance_length ~ age, data = utt_lengths)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71722 -0.15356 -0.01602  0.17418  0.51755 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.69576    0.20355   8.331 1.43e-06 ***
age          0.51891    0.01869  27.770 5.84e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3127 on 13 degrees of freedom
Multiple R-squared:  0.9834,    Adjusted R-squared:  0.9821 
F-statistic: 771.2 on 1 and 13 DF,  p-value: 5.843e-13

Easystats

Quick reports in text

library(report)
report(m)
We fitted a linear model (estimated using OLS) to predict utterance_length with age (formula: utterance_length ~ age). The model explains a statistically significant and substantial proportion of variance (R2 = 0.98, F(1, 13) = 771.19, p < .001, adj. R2 = 0.98). The model's intercept, corresponding to age = 0, is at 1.70 (95% CI [1.26, 2.14], t(13) = 8.33, p < .001). Within this model:

  - The effect of age is statistically significant and positive (beta = 0.52, 95% CI [0.48, 0.56], t(13) = 27.77, p < .001; Std. beta = 0.99, 95% CI [0.91, 1.07])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using the Wald approximation.

Quickly printing estimates

library(parameters)
model_parameters(m)
Parameter   | Coefficient |   SE |       95% CI | t(13) |      p
----------------------------------------------------------------
(Intercept) |        1.70 | 0.20 | [1.26, 2.14] |  8.33 | < .001
age         |        0.52 | 0.02 | [0.48, 0.56] | 27.77 | < .001
print_md(model_parameters(m))
Parameter Coefficient SE 95% CI t(13) p
(Intercept) 1.70 0.20 (1.26, 2.14) 8.33 < .001
age 0.52 0.02 (0.48, 0.56) 27.77 < .001

Check and plot assumptions

library(performance)
check_heteroscedasticity(m)
OK: Error variance appears to be homoscedastic (p = 0.237).
check_model(m, check = c("qq", "ncv"))

Evaluate model

model_performance(m) %>% print_md()
Indices of model performance
AIC BIC R2 R2 (adj.) RMSE Sigma
11.54 13.67 0.98 0.98 0.29 0.31

Multiple linear regression

Multiple linear regression

Multiple

more than one predictor

Notation

y ~ x1 + x2 (fitting on a plane)

y ~ x1 + x2 + ... + xn (fitting on a hyperplane)

Estimation: OSL

Ordinary least squares, minimizing sum of squares of residuals

Example

library(dplyr)

utt_lengths <- utt_lengths %>% 
  mutate(pages_read = 0 + 0.8*seq_along(age) + rnorm(length(age), sd = 1))

m2 <- glm(utterance_length ~ age + pages_read, data = utt_lengths)
utt_lengths$fit2 <- m2$fitted.values
head(utt_lengths) %>% knitr::kable()
age utterance_length fit pages_read fit2
3 3.770043 3.252489 2.393078 3.247895
4 3.647996 3.771397 2.645829 3.772950
5 4.230754 4.290305 3.507192 4.289808
6 4.566650 4.809213 4.444730 4.805639
7 5.400696 5.328121 4.297667 5.336080
8 5.129811 5.847029 5.284364 5.851249

Example 3D

Categorical predictors

  • With 2 levels, e.g. “monolingual family”: yes/no

  • Translated into 1 dummy predictor with two levels

    • 0: reference level, e.g. no

    • 1: other level, e.g. yes

  • Line fitted between 0 and 1, slope is the difference in y when yes (compared to no)

Categorical predictors

  • With more than 2 levels, e.g. “L1”: EN, Es, FR…

  • Translated into n-1 dummy predictors with two levels

    • 1: one of the levels, not the reference level, e.g. Es

    • 0: the rest of the levels, including the reference level (e.g. EN)

  • (Hyper)plane fitted between combinations of 0 and 1, slope is the difference in y when it is a level compared to the reference level.

Categorical predictors

  • Comparison is done between each level and the reference level, not in all combinations

  • t-test: does the slope of an individual dummy predictor differ from 0?

  • F-test of nested models: do the dummy predictors jointly reduce unexplained variation?

Logistic regression

Logistic regression

Logistic regression explains the probability of success (= a certain outcome)

We cannot fit a simple straight line

Probabilities, odds and logit

Value Range Neutral value Description
probabilities
\(P\)
0-1 0.5 Number of successes divided by number of trials
odds
\(\frac{P}{1-P}\)
0-\(\infty\) 1 Probability of success divided by the probability of failure.
Undefined for \(P=1\)
logit, log odds
\(\log\left(\frac{P}{1-P}\right)\)
\(-\infty\)-\(\infty\) 0 If positive, success is more likely; if negative failure is more likely.
Undefined for \(P=0\) and for \(P=1\)

Higher \(P\) -> higher odds -> higher logit

Some examples

We’ll create a vector probabilities with the values of fractions from \(\frac{1}{7}\) to \(\frac{1}{2}\) and then from \(1-\frac{1}{3}\) to \(1-\frac{1}{7}\).

MASS::fractions() prints them as fractions.

From there we compute odds and logit.

library(MASS) # to print fractions

probabilities <- c(1/c(7:2), 1-(1/c(3:7)))
probs <- tibble(
  P = probabilities,
  P_frac = as.character(fractions(P)),
  odds = P/(1-P),
  odds_frac = as.character(fractions(odds)),
  logit = log(odds)
)

Some examples

P P_frac odds odds_frac logit
0.143 1/7 0.167 1/6 -1.792
0.167 1/6 0.200 1/5 -1.609
0.200 1/5 0.250 1/4 -1.386
0.250 1/4 0.333 1/3 -1.099
0.333 1/3 0.500 1/2 -0.693
0.500 1/2 1.000 1 0.000
0.667 2/3 2.000 2 0.693
0.750 3/4 3.000 3 1.099
0.800 4/5 4.000 4 1.386
0.833 5/6 5.000 5 1.609
0.857 6/7 6.000 6 1.792

Probabilities, odds, logit

lnplot(probs, P, logit)

lnplot(probs, P, odds)

Linear/logistic

Linear relation logit ~ x entails logistic curve p ~ x.

with_x <- tibble(x = 1:30, logit = -3.5 + 0.3*x,
                 odds = exp(logit), P = odds/(1+odds))
lnplot(with_x, x, logit)

lnplot(with_x, x, P)

Linear/logistic

Linear relation logit ~ x entails logistic curve p ~ x.

with_x2 <- tibble(x = 1:30, logit = 3.5 - 0.3*x,
                 odds = exp(logit), P = odds/(1+odds))
lnplot(with_x2, x, logit)

lnplot(with_x2, x, P)

Example

set.seed(0)
reading_habits <- tibble(
  L1 = factor(rep(c("EN", "ES"), each = 50), levels = c("EN", "ES")),
  age = rep(15:6, 10)) %>%
  mutate(
    logit = 0.2 -0.8*as.numeric(L1) + 0.2*age + rnorm(nrow(.)),
    odds = exp(logit),
    P = odds/(odds+1),
    reads_EN = factor(P >= 0.5, c("FALSE", "TRUE"))
    )

m3 <- glm(reads_EN ~ L1 + age, data = reading_habits, family = binomial(logit))
reading_habits$fit <- m3$fitted.values

reading_habits %>% head() %>% knitr::kable()

Example

L1 age logit odds P reads_EN fit
EN 15 3.662954 38.9763201 0.9749852 TRUE 0.9968727
EN 14 1.873767 6.5127816 0.8668935 TRUE 0.9946208
EN 13 3.329799 27.9327340 0.9654371 TRUE 0.9907623
EN 12 3.072429 21.5942985 0.9557410 TRUE 0.9841801
EN 11 2.014641 7.4980384 0.8823258 TRUE 0.9730353
EN 10 -0.139950 0.8694017 0.4650695 FALSE 0.9544032

Visual exploration

ggplot(reading_habits, aes(x = age, y = P, color = L1)) +
  geom_point(size = 3, alpha = 0.6) +
  scale_color_manual(values=c("#E69F00", "#56B4E9")) +
  theme_minimal(base_size = 20) +
  labs(x = "Age", y = "Probability of reading in English")

Default output

summary(m3)

Call:
glm(formula = reads_EN ~ L1 + age, family = binomial(logit), 
    data = reading_habits)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4851   0.1262   0.2338   0.5153   1.2084  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -2.4052     1.4574  -1.650  0.09888 . 
L1ES         -0.9353     0.7086  -1.320  0.18687   
age           0.5446     0.1744   3.123  0.00179 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 73.385  on 99  degrees of freedom
Residual deviance: 56.258  on 97  degrees of freedom
AIC: 62.258

Number of Fisher Scoring iterations: 6

Easystats - logistic regression

Quick reports in text

report(m3)
We fitted a logistic model (estimated using ML) to predict reads_EN with L1 and age (formula: reads_EN ~ L1 + age). The model's explanatory power is moderate (Tjur's R2 = 0.19). The model's intercept, corresponding to L1 = EN and age = 0, is at -2.41 (95% CI [-5.52, 0.31], p = 0.099). Within this model:

  - The effect of L1 [ES] is statistically non-significant and negative (beta = -0.94, 95% CI [-2.43, 0.41], p = 0.187; Std. beta = -0.94, 95% CI [-2.43, 0.41])
  - The effect of age is statistically significant and positive (beta = 0.54, 95% CI [0.25, 0.94], p = 0.002; Std. beta = 1.57, 95% CI [0.71, 2.73])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using 

Quickly printing estimates

print_md(model_parameters(m3))
Parameter Log-Odds SE 95% CI z p
(Intercept) -2.41 1.46 (-5.52, 0.31) -1.65 0.099
L1 (ES) -0.94 0.71 (-2.43, 0.41) -1.32 0.187
age 0.54 0.17 (0.25, 0.94) 3.12 0.002
plot(model_parameters(m3))

Check and plot assumptions

check_collinearity(m3)
# Check for Multicollinearity

Low Correlation

 Term  VIF       VIF 95% CI Increased SE Tolerance Tolerance 95% CI
   L1 1.01 [1.00, 62814.39]         1.01      0.99     [0.00, 1.00]
  age 1.01 [1.00, 62814.39]         1.01      0.99     [0.00, 1.00]
check_model(m3, check = c("binned_residuals", "vif", "outliers"))

Evaluate model

model_performance(m3, metrics = "common") %>% 
  mutate(C = Hmisc::somers2(
    reading_habits$fit,
    as.numeric(reading_habits$reads_EN == "TRUE")
  )[["C"]]
  ) %>% 
  print_md()
Indices of model performance
AIC BIC Tjur’s R2 RMSE C
62.26 70.07 0.19 0.29 0.84

Extra code

Plot code

lnplot <- function(df, x, y) {
  ggplot(df, aes(x = {{ x }}, y = {{ y }})) +
    geom_line() +
    geom_point() + 
    theme_minimal(base_size = 20) +
    theme(aspect.ratio = 1)
}